The aim of this project is to construct a phylogenetic tree of viruses in the order Mononegavirales based on differentiation in their complete genomic sequences obtained from Genbank. The tree will visually iterate the evolutionary relationships between viruses in this large order, including those viruses in several important families whose members are known to be pathogenic to humans. Some such families under the order Mononegavirales are Filoviridae (contains Ebola virus), Rhabdoviridae (includes Rabies virus), and Paromixoviridae (contains Measles and Mumps viruses).
To communicate my final project I will detail how I went about constructing the phylogenetic tree which I built based on the complete genomic sequences of 220 virus isolates in the order Mononegavirales obtained from Genbank
Below I load up all R packages, some very unfamiliar to me before the initiation of this project, into my work space
#CRAN Packages
library(tidyverse)
library(dplyr)
library(readr)
library(kableExtra)
library(ape)
library(seqinr)
library(rentrez)
library(devtools)
library(stringr)
library(rentrez)
library(phangorn)
#Bioconductor packages
library(msa)
library(Biostrings)
library(ggtree)
library(DECIPHER)
library(biomaRt)
## Warning: replacing previous import 'utils::findMatches' by
## 'S4Vectors::findMatches' when loading 'AnnotationDbi'
#github packages
library(compbio4all)
library(ggmsa)
The next step for me was naturally to create a FASTA file of all complete genome sequences that one can download and view. This is a way to combine all genomic data into a single document (a specialized list in R) that can be re-imported later for further analysis.
I accomplished this process with the rentrez package
using the function entrez_fetch_list(). This function takes
in Genbank accession numbers, connects to an NCBI database
db (in my case the NCBI nucleotide database) and generates
a list of complete genome sequences in FASTA format. The FASTA can then
be saved with the function write.fasta() The file is saved
onto my computer as virus_genomes.FASTA.
If you want to view the FASTA for yourself here is a link to download: complete list of virus genomes (2.9MB)
Instead of copying and pasting or typing out each individual
accession number to fulfill the names argument for the
function which would be excruciating, I simply inserted my previously
made vector of accession numbers, which is saved as the object
virus_accession_vector.
Here is the code to pull complete genome sequences from NCBI by accession numbers and save as a FASTA:
virus_fasta_list <- entrez_fetch_list(db = "nucleotide",
id =virus_list,
rettype = "fasta")
write.fasta(sequences = virus_fasta_list, names = names(virus_accession_vector),
file.out = "./Data/virus_genomes.FASTA")
The next thing I wanted to do was create a data table where the accession number for a virus, species, and respective family could be viewed at once. This was surprisingly tricky. After much frustration, and many unnescessary forloops I was able to accomplish this.
Anyway, here is the (slightly embarassing) code I wrote to acquire the data for my table:
#Assign family names to each taxa IN ORDER of how accession numbers were concatenated (I physically counted how many of each...and double checked)
s1 <- rep("Filoviridae", 12)
s2 <- rep("Artoviridae",7 )
s3 <- rep("Bornaviridae",18)
s4 <- rep("Lispiviridae", 6)
s5 <- rep("Mymonaviridae", 4)
s6 <- rep("Nyamiviridae", 13)
s7 <- rep("Paramyxoviridae",72)
s8 <- rep("Pneumoviridae", 9)
s9 <- rep("Sunshinevirus", 1)
s10 <- rep("Xinmoviridae",8)
s11 <- rep("Rhabdoviridae", 70)
#combine into a vector (same length as virus_accession_vector)
fam <- c(s1,s2,s3,s4,s5,s6,s7,s8,s9,s10,s11)
#create data frame with accession number and family.
df <- base::data.frame(accession_number=virus_accession_vector, virus_family=fam)
#WE AIN"T DONE YET
#creating vector of organism names by splitting up the FASTA I just created and extracting from there with a billion forloops.
#To tell you the truth I cannot even explain what my thought process was here... just trial error and tears I think. But in the end, it worked!
list_split <- list()
for (i in 1:length(virus_fasta_list)){list_split[[i]] <- str_split(virus_fasta_list[[i]],pattern = ',')
}
virus_names <- list()
for (i in 1:length(list_split)) {virus_names[[i]] <- purrr::map(list_split[[i]],1)
}
virus_species <- list()
virus_1 <- str_split(virus_names, "\\\\")
for (i in 1:length(virus_1)) {virus_species[[i]] <- purrr::map(virus_1[[i]][[1]],1)
}
virus_species <- unlist(virus_species)
virus_species <- str_remove(virus_species, "^...................")
virus_species <- str_remove(virus_species, "..$")
#overwriting my original data frame df, with a new one, where virus species is a column.
df <- data.frame(df, virus_species=virus_species)
The data table of many tears!
## New names:
## Rows: 220 Columns: 4
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (3): accession_number, virus_family, virus_species dbl (1): ...1
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
| …1 | accession_number | virus_family | virus_species |
|---|---|---|---|
| 1 | NC_016144 | Filoviridae | Lloviu cuevavirus isolate Lloviu virus/M.schreibersii-wt/ESP/2003/Asturias-Bat86 |
| 2 | NC_055510 | Filoviridae | Mengla dianlovirus isolate Rousettus-wt/CHN/2015/Sharen-Bat9447-1 |
| 3 | NC_039345 | Filoviridae | Bombali ebolavirus isolate Bombali ebolavirus/Mops condylurus/SLE/2016/PREDICT_SLAB000156 |
| 4 | NC_014373 | Filoviridae | Bundibugyo ebolavirus |
| 5 | NC_004161 | Filoviridae | Reston ebolavirus isolate Reston virus/M.fascicularis-tc/USA/1989/Philippines89-Pennsylvania |
| 6 | NC_006432 | Filoviridae | Sudan ebolavirus isolate Sudan virus/H.sapiens-tc/UGA/2000/Gulu-808892 |
| 7 | NC_014372 | Filoviridae | Tai Forest ebolavirus isolate Tai Forest virus/H.sapiens-tc/CIV/1994/Pauleoula-CI |
| 8 | NC_002549 | Filoviridae | Zaire ebolavirus isolate Ebola virus/H.sapiens-tc/COD/1976/Yambuku-Mayinga |
| 9 | NC_001608 | Filoviridae | Marburg marburgvirus isolate Marburg virus/H.sapiens-tc/KEN/1980/Mt. Elgon-Musoke |
| 10 | NC_024781 | Filoviridae | Marburg marburgvirus isolate Ravn virus/H.sapiens-tc/KEN/1987/Kitum Cave-810040 |
| 11 | NC_055175 | Filoviridae | Wenling frogfish filovirus strain XYHYS28627 nucleoprotein |
| 12 | NC_055176 | Filoviridae | Wenling thamnaconus septentrionalis filovirus strain LQMMTII17328 hypothetical protein |
| 13 | NC_032430 | Artoviridae | Beihai barnacle virus 8 strain BHTH14652 hypothetical protein 1 |
| 14 | NC_032555 | Artoviridae | Beihai rhabdo-like virus 1 strain BHTSS15727 hypothetical protein 1 |
| 15 | NC_032558 | Artoviridae | Beihai rhabdo-like virus 2 strain BHTSS7258 hypothetical protein 1 |
| 16 | NC_055115 | Artoviridae | Hubei rhabdo-like virus 5 strain WHSFII19440 RNA-dependent RNA polymerase gene |
| 17 | NC_055114 | Artoviridae | Hubei rhabdo-like virus 6 strain QTM24798 hypothetical protein and RNA-dependent RNA polymerase genes |
| 18 | NC_055113 | Artoviridae | Hubei rhabdo-like virus 8 strain QTM19395 RNA-dependent RNA polymerase gene |
| 19 | NC_038269 | Artoviridae | Pteromalus puparum negative-strand RNA virus 1 |
| 20 | NC_039013 | Bornaviridae | Jungle carpet python virus |
| 21 | NC_039014 | Bornaviridae | Southwest carpet python virus |
| 22 | NC_055169 | Bornaviridae | Wuhan sharpbelly bornavirus strain DSYS4497 nucleoprotein gene |
| 23 | NC_029642 | Bornaviridae | Aquatic bird bornavirus 1 |
| 24 | NC_030691 | Bornaviridae | Aquatic bird bornavirus 2 |
| 25 | NC_001607 | Bornaviridae | Borna disease virus 1 |
| 26 | NC_030692 | Bornaviridae | Borna disease virus 2 |
| 27 | NC_030690 | Bornaviridae | Canary bornavirus 1 |
| 28 | NC_027892 | Bornaviridae | Canary bornavirus 2 |
| 29 | NC_024296 | Bornaviridae | Canary bornavirus 3 |
| 30 | NC_038268 | Bornaviridae | Estrildid finch bornavirus 1 isolate VS-4707 nucleoprotein (N) |
| 31 | NC_024778 | Bornaviridae | Loveridges garter snake virus 1 strain 251327 |
| 32 | NC_039189 | Bornaviridae | Parrot bornavirus 1 strain M25 N protein (N) |
| 33 | NC_028106 | Bornaviridae | Parrot bornavirus 2 |
| 34 | NC_030688 | Bornaviridae | Parrot bornavirus 4 |
| 35 | NC_030689 | Bornaviridae | Parrot bornavirus 7 |
| 36 | NC_039190 | Bornaviridae | Parrot bornavirus 5 strain 2014-A |
| 37 | NC_030701 | Bornaviridae | Variegated squirrel bornavirus 1 |
| 38 | NC_031259 | Lispiviridae | Lishi Spider Virus 2 strain LSZZ-4 ORF1 (ORF1) |
| 39 | NC_032944 | Lispiviridae | Hubei odonate virus 10 strain QTM133243 hypothetical protein 1 |
| 40 | NC_031270 | Lispiviridae | Tacheng Tick Virus 6 strain TCRP-2 ORF1 (ORF1) |
| 41 | NC_032929 | Lispiviridae | Hubei rhabdo-like virus 3 strain QCM135517 hypothetical protein 1 |
| 42 | NC_033436 | Lispiviridae | Wuchan romanomermis nematode virus 2 strain WCLSXC55347 hypothetical protein 1 |
| 43 | NC_031088 | Lispiviridae | Sanxia Water Strider Virus 4 strain SXSSP12 ORF1 (ORF1) |
| 44 | NC_043483 | Mymonaviridae | Sclerotinia sclerotiorum negative-stranded RNA virus 4 isolate 257 gp5 |
| 45 | NC_032783 | Mymonaviridae | Hubei rhabdo-like virus 4 strain arthropodmix13990 hypothetical protein 1 |
| 46 | NC_025383 | Mymonaviridae | Sclerotinia sclerotiorum negative-stranded RNA virus 1 strain AH98 |
| 47 | NC_026732 | Mymonaviridae | Sclerotinia sclerotiorum negative-stranded RNA virus 3 isolate IL1 gp6 |
| 48 | NC_043486 | Nyamiviridae | Beihai rhabdo-like virus 3 strain BHNXC39077 putative nucleoprotein |
| 49 | NC_032543 | Nyamiviridae | Beihai rhabdo-like virus 4 strain BHJP58499 putative nucleoprotein |
| 50 | NC_032544 | Nyamiviridae | Beihai rhabdo-like virus 5 strain BHJP63888 putative nucleoprotein |
| 51 | NC_032541 | Nyamiviridae | Beihai rhabdo-like virus 6 strain BHJJX49420 putative nucleoprotein |
| 52 | NC_032793 | Nyamiviridae | Wenling crustacean virus 12 strain WLJQ47777 putative nucleoprotein |
| 53 | NC_031275 | Nyamiviridae | Wenzhou Crab Virus 1 strain RBX2 nucleocapsid (N) |
| 54 | NC_012702 | Nyamiviridae | Midway virus |
| 55 | NC_012703 | Nyamiviridae | Nyamanini virus |
| 56 | NC_024376 | Nyamiviridae | Sierra Nevada virus |
| 57 | NC_043485 | Nyamiviridae | Orinoco virus strain UW1 |
| 58 | NC_024702 | Nyamiviridae | Soybean cyst nematode midway virus N-protein |
| 59 | NC_033450 | Nyamiviridae | Wenzhou tapeworm virus 1 strain SGJSC14943 RNA-dependent RNA polymerase gene |
| 60 | NC_033451 | Nyamiviridae | Wenzhou tapeworm virus 1 strain SGJSC14943 hypothetical protein 1 |
| 61 | NC_039017 | Paramyxoviridae | Antarctic penguin virus A |
| 62 | NC_039018 | Paramyxoviridae | Antarctic penguin virus B |
| 63 | NC_039019 | Paramyxoviridae | Antarctic penguin virus C |
| 64 | NC_025407 | Paramyxoviridae | Avian paramyxovirus 11 isolate common_snipe/France/100212/2010 |
| 65 | NC_039230 | Paramyxoviridae | Avian paramyxovirus 2 strain APMV-2/Chicken/California/Yucaipa/56 |
| 66 | NC_040796 | Paramyxoviridae | Avian paramyxovirus 20 isolate APMV-20/gull/Kazakhstan/5976/2014 |
| 67 | NC_025361 | Paramyxoviridae | Avian paramyxovirus 5 strain budgerigar/Kunitachi/74 |
| 68 | NC_039194 | Paramyxoviridae | Avian paramyxovirus 6 strain APMV-6/Goose/FarEast/4440/2003 |
| 69 | NC_025347 | Paramyxoviridae | Avian paramyxovirus 7 strain APMV-7/dove/Tennessee/4/75 |
| 70 | NC_039195 | Paramyxoviridae | Avian paramyxovirus 8 strain goose/Delaware/1053/76 |
| 71 | NC_039223 | Paramyxoviridae | Newcastle disease virus isolate JSD0812 |
| 72 | NC_025363 | Paramyxoviridae | Avian paramyxovirus 12 isolate Wigeon/Italy/3920_1/2005 |
| 73 | NC_025390 | Paramyxoviridae | Avian paramyxovirus 9 strain duck/New York/22/1978 |
| 74 | NC_025373 | Paramyxoviridae | Avian paramyxovirus 3 strain turkey/Wisconsin/68 |
| 75 | NC_039015 | Paramyxoviridae | Avian paramyxovirus 14 isolate APMV14/duck/Japan/11OG0352/2011 |
| 76 | NC_034968 | Paramyxoviridae | Avian paramyxovirus 15 isolate APMV-15/calidris_fuscicollis/Brazil/RS-1177/2012 |
| 77 | NC_019531 | Paramyxoviridae | Avian paramyxovirus 4 strain APMV-4/duck/Delaware/549227/2010 |
| 78 | NC_039016 | Paramyxoviridae | Avian paramyxovirus UPO216 isolate APMV-15/WB/Kr/UPO216/2014 |
| 79 | NC_030231 | Paramyxoviridae | Avian paramyxovirus goose/Shimane/67/2000 viral cRNA |
| 80 | NC_025349 | Paramyxoviridae | Avian paramyxovirus penguin/Falkland Islands/324/2007 |
| 81 | NC_005036 | Paramyxoviridae | Goose paramyxovirus SF02 |
| 82 | NC_007803 | Paramyxoviridae | Beilong virus |
| 83 | NC_002161 | Paramyxoviridae | Bovine parainfluenza virus 3 |
| 84 | NC_001921 | Paramyxoviridae | Canine distemper virus |
| 85 | NC_028362 | Paramyxoviridae | Caprine parainfluenza virus 3 strain JS2013 |
| 86 | NC_025351 | Paramyxoviridae | Cedar virus isolate CG1a |
| 87 | NC_005283 | Paramyxoviridae | Dolphin morbillivirus |
| 88 | NC_039196 | Paramyxoviridae | Feline morbillivirus strain 761U |
| 89 | NC_005084 | Paramyxoviridae | Fer-de-lance virus |
| 90 | NC_025256 | Paramyxoviridae | Bat Paramyxovirus Eid_hel/GH-M74a/GHA/2009 |
| 91 | NC_001906 | Paramyxoviridae | Hendra virus |
| 92 | NC_003461 | Paramyxoviridae | Human parainfluenza virus 1 |
| 93 | NC_001796 | Paramyxoviridae | Human parainfluenza virus 3 |
| 94 | NC_038270 | Paramyxoviridae | Simian Agent 10 |
| 95 | NC_007454 | Paramyxoviridae | J-virus |
| 96 | NC_001498 | Paramyxoviridae | Measles virus |
| 97 | NC_025352 | Paramyxoviridae | Mojiang virus isolate Tongguan1 |
| 98 | NC_005339 | Paramyxoviridae | Mossman virus |
| 99 | NC_043539 | Paramyxoviridae | Mount Mabu Lophuromys virus 1 isolate MOZ135_1 |
| 100 | NC_043540 | Paramyxoviridae | Mount Mabu Lophuromys virus 2 isolate MOZ135_2 |
| 101 | NC_001552 | Paramyxoviridae | Sendai virus genomic RNA |
| 102 | NC_017937 | Paramyxoviridae | Nariva virus |
| 103 | NC_002728 | Paramyxoviridae | Nipah virus |
| 104 | NC_006383 | Paramyxoviridae | Peste des petits ruminants virus complete geno |
| 105 | NC_028249 | Paramyxoviridae | Phocine distemper virus strain PDV/Wadden_Sea.NLD/1988 |
| 106 | NC_055168 | Paramyxoviridae | Pohorje myodes paramyxovirus 1 isolate TT02/05 |
| 107 | NC_025402 | Paramyxoviridae | Porcine parainfluenza virus 1 strain S206N |
| 108 | NC_006296 | Paramyxoviridae | Rinderpest virus (strain Kabete O) |
| 109 | NC_025386 | Paramyxoviridae | Salem virus |
| 110 | NC_025360 | Paramyxoviridae | Atlantic salmon paramyxovirus isolate ASPV/Yrkje371/95 |
| 111 | NC_025355 | Paramyxoviridae | Tailam virus strain TL8K |
| 112 | NC_002199 | Paramyxoviridae | Tupaia paramyxovirus |
| 113 | NC_055167 | Paramyxoviridae | Bank vole virus strain RP-12 |
| 114 | NC_025403 | Paramyxoviridae | Achimota virus 1 |
| 115 | NC_025404 | Paramyxoviridae | Achimota virus 2 |
| 116 | NC_055508 | Paramyxoviridae | Alston virus isolate Alstonville |
| 117 | NC_038271 | Paramyxoviridae | Bat Paramyxovirus Epo_spe/AR1/DRC/2009 |
| 118 | NC_055459 | Paramyxoviridae | UNVERIFIED: Hervey virus isolate BO6 genomic sequen |
| 119 | NC_003443 | Paramyxoviridae | Human rubulavirus 2 |
| 120 | NC_021928 | Paramyxoviridae | Human parainfluenza virus 4a viral cRNA |
| 121 | NC_009489 | Paramyxoviridae | Mapuera virus |
| 122 | NC_039197 | Paramyxoviridae | Menangle virus isolate Australia/bat/2009/Cedar Grove |
| 123 | NC_002200 | Paramyxoviridae | Mumps orthorubulavirus genomic RNA |
| 124 | NC_006430 | Paramyxoviridae | Parainfluenza virus 5 strain W |
| 125 | NC_009640 | Paramyxoviridae | Porcine rubulavirus |
| 126 | NC_006428 | Paramyxoviridae | Simian virus 41 |
| 127 | NC_025343 | Paramyxoviridae | Sosuga virus isolate 2012 |
| 128 | NC_039198 | Paramyxoviridae | Teviot virus isolate Cedar Grove |
| 129 | NC_004074 | Paramyxoviridae | Tioman virus |
| 130 | NC_025410 | Paramyxoviridae | Tuhoko virus 1 |
| 131 | NC_025348 | Paramyxoviridae | Tuhoko virus 2 |
| 132 | NC_025350 | Paramyxoviridae | Tuhoko virus 3 |
| 133 | NC_039231 | Pneumoviridae | Avian pneumovirus strain LAH A |
| 134 | NC_039199 | Pneumoviridae | Human metapneumovirus isolate 00-1 |
| 135 | NC_001989 | Pneumoviridae | Bovine orthopneumovirus |
| 136 | NC_038272 | Pneumoviridae | Bovine respiratory syncytial virus ATCC51908 |
| 137 | NC_001803 | Pneumoviridae | Respiratory syncytial virus |
| 138 | NC_038235 | Pneumoviridae | Human orthopneumovirus Subgroup A |
| 139 | NC_001781 | Pneumoviridae | Human orthopneumovirus Subgroup B |
| 140 | NC_006579 | Pneumoviridae | Pneumonia virus of mice J3666 |
| 141 | NC_025344 | Pneumoviridae | Pneumovirus dog/Bari/100-12/ITA/2012 |
| 142 | NC_025345 | Sunshinevirus | Sunshine virus |
| 143 | NC_033055 | Xinmoviridae | Hubei diptera virus 11 strain SCM51501 hypothetical protein 1 |
| 144 | NC_055111 | Xinmoviridae | Shuangao Fly Virus 2 strain QSA05 RNA-dependent RNA polymerase (L) gene |
| 145 | NC_031244 | Xinmoviridae | Xincheng Mosquito Virus strain XC1-6 ORF1 (ORF1) |
| 146 | NC_035133 | Xinmoviridae | Culex mononega-like virus 2 strain mos191gb23464 |
| 147 | NC_033027 | Xinmoviridae | Hubei rhabdo-like virus 7 strain QTM26925 hypothetical protein |
| 148 | NC_043484 | Xinmoviridae | Drosophila unispina virus 1 ORF1 |
| 149 | NC_055112 | Xinmoviridae | Bolahun virus variant 2 putative nucleoprotein |
| 150 | NC_032849 | Xinmoviridae | Hubei orthoptera virus 5 strain ZCM94262 hypothetical protein 1 |
| 151 | NC_028246 | Rhabdoviridae | Adelaide River virus isolate DPP61 |
| 152 | NC_025391 | Rhabdoviridae | Almpiwar virus isolate MRM4059 |
| 153 | NC_022755 | Rhabdoviridae | American bat vesiculovirus TFFN-2013 isolate liver2008 |
| 154 | NC_034535 | Rhabdoviridae | Barur virus nucleoprotein |
| 155 | NC_025358 | Rhabdoviridae | Berrimah virus strain DPP 63 |
| 156 | NC_002526 | Rhabdoviridae | Bovine ephemeral fever virus |
| 157 | NC_034550 | Rhabdoviridae | Chaco virus nucleoprotein |
| 158 | NC_020805 | Rhabdoviridae | Chandipura virus isolate CIN 0451 |
| 159 | NC_025397 | Rhabdoviridae | Coastal Plains virus strain DPP53 |
| 160 | NC_028255 | Rhabdoviridae | Cocal virus Indiana 2 |
| 161 | NC_055509 | Rhabdoviridae | Cuiaba virus strain BeAn 227841 |
| 162 | NC_025406 | Rhabdoviridae | Dolphin rhabdovirus isolate pxV1 1992 |
| 163 | NC_038284 | Rhabdoviridae | Durham virus nucleocapsid |
| 164 | NC_022581 | Rhabdoviridae | Eel Virus European X complete genome |
| 165 | NC_009527 | Rhabdoviridae | European bat lyssavirus 1 |
| 166 | NC_009528 | Rhabdoviridae | European bat lyssavirus 2 isolate RV1333 |
| 167 | NC_025253 | Rhabdoviridae | Farmington virus strain CT 114 |
| 168 | NC_025341 | Rhabdoviridae | Fikirini bat rhabdovirus isolate KEN352 |
| 169 | NC_028867 | Rhabdoviridae | Fox fecal rhabdovirus isolate S40 nucleocapsid protein |
| 170 | NC_031988 | Rhabdoviridae | Gannoruwa bat lyssavirus isolate RV3266 |
| 171 | NC_055530 | Rhabdoviridae | Garba virus nucleoprotein |
| 172 | NC_025376 | Rhabdoviridae | Grass carp rhabdovirus V76 |
| 173 | NC_039202 | Rhabdoviridae | Flanders virus nucleoprotein |
| 174 | NC_018629 | Rhabdoviridae | Ikoma lyssavirus |
| 175 | NC_001652 | Rhabdoviridae | Infectious hematopoietic necrosis virus |
| 176 | NC_020806 | Rhabdoviridae | Isfahan virus N gene |
| 177 | NC_034451 | Rhabdoviridae | Kern Canyon virus nucleoprotein |
| 178 | NC_034540 | Rhabdoviridae | Keuraliba virus nucleoprotein |
| 179 | NC_025396 | Rhabdoviridae | Kimberley virus isolate CS368 |
| 180 | NC_034549 | Rhabdoviridae | Klamath virus nucleoprotein |
| 181 | NC_028239 | Rhabdoviridae | Koolpinyah virus isolate DPP819 |
| 182 | NC_028236 | Rhabdoviridae | Kumasi rhabdbovirus |
| 183 | NC_038277 | Rhabdoviridae | Trout rhabdovirus 903/87 nucleocapsid protein |
| 184 | NC_034533 | Rhabdoviridae | Landjia virus nucleoprotein |
| 185 | NC_038276 | Rhabdoviridae | Nishimuro virus viral cRNA for hypothetical proteins |
| 186 | NC_031955 | Rhabdoviridae | Lleida bat lyssavirus isolate RV3208 |
| 187 | NC_020808 | Rhabdoviridae | Aravan virus |
| 188 | NC_003243 | Rhabdoviridae | Australian bat lyssavirus |
| 189 | NC_025251 | Rhabdoviridae | Bokeloh bat lyssavirus isolate 21961 |
| 190 | NC_025377 | Rhabdoviridae | West Caucasian bat virus |
| 191 | NC_020810 | Rhabdoviridae | Duvenhage virus isolate 86132SA |
| 192 | NC_020809 | Rhabdoviridae | Irkut virus |
| 193 | NC_025385 | Rhabdoviridae | Khujand lyssavirus |
| 194 | NC_020807 | Rhabdoviridae | Lagos bat virus isolate 0406SEN |
| 195 | NC_006429 | Rhabdoviridae | Mokola virus |
| 196 | NC_001542 | Rhabdoviridae | Rabies virus |
| 197 | NC_025365 | Rhabdoviridae | Shimoni bat virus |
| 198 | NC_034530 | Rhabdoviridae | Marco virus nucleoprotein |
| 199 | NC_034545 | Rhabdoviridae | Mount Elgon bat virus nucleoprotein |
| 200 | NC_005093 | Rhabdoviridae | Hirame rhabdovirus |
| 201 | NC_034548 | Rhabdoviridae | Oita virus nucleoprotein |
| 202 | NC_020803 | Rhabdoviridae | Perch perhabdovirus isolate PRV nucleoprotein (N) gene |
| 203 | NC_038286 | Rhabdoviridae | Piry virus strain BeAn2423 |
| 204 | NC_025387 | Rhabdoviridae | Scophthalmus maximus rhabdovirus |
| 205 | NC_034529 | Rhabdoviridae | Sena Madureira virus nucleoprotein |
| 206 | NC_008514 | Rhabdoviridae | Siniperca chuatsi rhabdovirus |
| 207 | NC_000903 | Rhabdoviridae | Snakehead rhabdovirus complete geno |
| 208 | NC_002803 | Rhabdoviridae | Spring viraemia of carp virus |
| 209 | NC_025356 | Rhabdoviridae | Pike fry rhabdovirus isolate F4 |
| 210 | NC_025401 | Rhabdoviridae | Sunguru virus isolate Ug#41 |
| 211 | NC_055474 | Rhabdoviridae | Taiwan bat lyssavirus isolate TWBLV/TN/2016 |
| 212 | NC_025371 | Rhabdoviridae | Tench rhabdovirus S64 |
| 213 | NC_020804 | Rhabdoviridae | Tibrogargan virus strain CS132 |
| 214 | NC_007020 | Rhabdoviridae | Tupaia virus |
| 215 | NC_043538 | Rhabdoviridae | Vaprio virus nucleoprotein |
| 216 | NC_025353 | Rhabdoviridae | Vesicular stomatitis Alagoas virus Indiana 3 |
| 217 | NC_001560 | Rhabdoviridae | Vesicular stomatitis Indiana virus |
| 218 | NC_038236 | Rhabdoviridae | Vesicular stomatitis Indiana virus strain 98COE |
| 219 | NC_024473 | Rhabdoviridae | Vesicular stomatitis New Jersey virus isolate NJ1184HDB |
| 220 | NC_000855 | Rhabdoviridae | Viral hemorrhagic septicemia virus Fil3 |
The first thing I need to do to get a good alignment,
is clean my sequences into a useable format. Luckily for me, this is a
fairly straightforward process and there are several good functions to
help like fasta_cleaner from the package
campbio4all. In esssence I need to make all my sequences
into a single character vector of length 220. This means each individual
component of the vector needs to be a single character string of a
complete genome.
To accomplish this first I will use the function
fasta cleaner on each element in virus fasta list:
genome_list <- list()
for(i in 1:length(virus_fasta_list)){
genome_list[[i]] <- fasta_cleaner(virus_fasta_list[[i]], parse = F)
}
Then, I need to create an empty vector of length 220. This will literally be a vector of 220 NA’s. Dont worry. I will fill it up later with sequences!
Note: genome list is the same length as the number of genome
sequences in my FASTA file, which is 220. So saying
virus_vector = rep(NA, 220) and
virus_vector = rep(NA, length(genome_list)) accomplishes
the same thing.
virus_vector <- rep(NA, length(genome_list))
Next, I replace each NA in my vector with a complete sequence using a forloop
for(i in 1:length(virus_vector)){
virus_vector[i] <- genome_list[[i]]}
Last, I add the names of the virus species (that I extracted earlier) to the virus vector:
names(virus_vector) <- virus_species
Heres a glimpse of that long character vector:
glimpse(virus_vector)
## chr [1:220] "CACACAAAAAGAAAAAAAGTTTTTTTAACCTTTTTGTGTCCTTAATACCTTGAAGAATATTAATCTCTTCCCGGAAATCGAGATCGAGATCTGGCTTCAAGTAACACTAAT"| __truncated__ ...
Converting my newly made vector into a Biostrings DNA
stringset. It is important to note that Mononegavirales genomes are RNA
so any sequence extracted is going to be the compliment of the actual
genome. This doesn’t affect any part of in phylogenetic tree
construction as it is based on nucleotide differentiation between
genomes and because their nucleotides are all complimented the
differences would be same. The original sequences pulled from Genbank
were actually in DNA format (using A, C, T, G), and so they too are the
compliment of the viral genomes.
Here is the code to construct the stringset:
virus_ss <- Biostrings::DNAStringSet(virus_vector)
Here is the stringset:
virus_ss
## DNAStringSet object of length 220:
## width seq names
## [1] 18927 CACACAAAAAGAAAAAAAGTTT...GAATTAAGAAAAAACTAGTTGT Lloviu cuevavirus...
## [2] 18300 GAGGAATATTAATTTTTATAAG...GTGAAAGAATATTAAGAAAAAA Mengla dianloviru...
## [3] 19043 CGGACACACAAAAAGAAAGAAG...TTTTTTCCTTTTTGTGTGTCCT Bombali ebolaviru...
## [4] 18940 CGGACACACAAAAAGAATGAAG...TTTTCTCTTTTTTGTGTGTCCA Bundibugyo ebolav...
## [5] 18891 CGGACACACAAAAAGAAAAAAG...ATTTTTTCCTTTTTGTGTGTCC Reston ebolavirus...
## ... ... ...
## [216] 11070 ACGAAGACAAACAAACCATTTA...TAGCTGGTTTTGTTGTTTCCGT Vesicular stomati...
## [217] 11161 ACGAAGACAAACAAACCATTAT...TATCTGGTTTTGTGGTCTTCGT Vesicular stomati...
## [218] 11162 ACGAAGACAAACAAACCATTAT...TATCTGGTTTTGTGGTCTTCGT Vesicular stomati...
## [219] 11123 ACGAAGACAAAAAAACCATTAT...GAATTGGTTTTTTTGTCTTCGT Vesicular stomati...
## [220] 11158 GTATCATAAAAGATGATGAGTT...TGGTATGTATTATTTTCTATAC Viral hemorrhagic...
How genomes align relative to each other give the information needed to construct a phylogenetic tree. Differences in the alignment from one genome to another reflect evolutionary differentiation. This evolutionary differentiation allocates to the ‘distance’ between two genomes. A phylogeny is an informative diagram of virus’s evolutionary relationships or distances to and from each other, less genomically ‘distant’ viruses are placed closer together on the tree while more distant viruses are placed farther apart on the tree. Positions in the tree are assigned by best fit models that can interpret genomic alignments.
Virus alignment was performed using the alignseqs()
function from the DECIPHER package.
Here is the code for the alignment:
virus_align <- DECIPHER::AlignSeqs(virus_ss)
Here is a peek of the alignment:
virus_align
## DNAStringSet object of length 220:
## width seq names
## [1] 47301 ----------------------...---------------------- Lloviu cuevavirus...
## [2] 47301 ---------GAGGAATATTAAT...---------------------- Mengla dianloviru...
## [3] 47301 -------------------CGG...---------------------- Bombali ebolaviru...
## [4] 47301 -------------------CGG...---------------------- Bundibugyo ebolav...
## [5] 47301 -------------------CGG...---------------------- Reston ebolavirus...
## ... ... ...
## [216] 47301 ----------------------...---------------------- Vesicular stomati...
## [217] 47301 ----------------------...---------------------- Vesicular stomati...
## [218] 47301 ----------------------...---------------------- Vesicular stomati...
## [219] 47301 ----------------------...---------------------- Vesicular stomati...
## [220] 47301 -------------------GTA...---------------------- Viral hemorrhagic...
By default the alignseqs() function creates a
DNAstringSet object. Although I can use this object to
create basic phylogenetic trees with distance based methods, I will need
to convert to a DNAMultipleAlignment object moving forward
to create more complex and representative trees using model optimization
and other methods.
This is an easy conversion. I will keep virus_align as a
DNAStringSet and create a DNAMultipleAlignment
called `Multiple_virus_align, so I can still utilize both.
multiple_virus_align <- DNAMultipleAlignment(virus_align)
Heres a look at the multiple alignment:
multiple_virus_align
## DNAMultipleAlignment with 220 rows and 47301 columns
## aln names
## [1] -----------------------CA...------------------------ Lloviu cuevavirus...
## [2] ---------GAGGAATATTAATTTT...------------------------ Mengla dianloviru...
## [3] -------------------CGGACA...------------------------ Bombali ebolaviru...
## [4] -------------------CGGACA...------------------------ Bundibugyo ebolav...
## [5] -------------------CGGACA...------------------------ Reston ebolavirus...
## [6] -------------------CGGACA...------------------------ Sudan ebolavirus ...
## [7] -------------------CGGACA...------------------------ Tai Forest ebolav...
## [8] -------------------CGGACA...------------------------ Zaire ebolavirus ...
## [9] ---------------AGACACACAA...------------------------ Marburg marburgvi...
## ... ...
## [212] -------------------------...------------------------ Tench rhabdovirus...
## [213] -------------------------...------------------------ Tibrogargan virus...
## [214] -------------------------...------------------------ Tupaia virus
## [215] -------------------------...------------------------ Vaprio virus nucl...
## [216] -------------------------...------------------------ Vesicular stomati...
## [217] -------------------------...------------------------ Vesicular stomati...
## [218] -------------------------...------------------------ Vesicular stomati...
## [219] -------------------------...------------------------ Vesicular stomati...
## [220] -------------------GTATCA...------------------------ Viral hemorrhagic...
The function ggmsa gives a more visual representation of
the genome alignment. Here I have selected a segment from nucleotide
2060 - 2080, a region of much differentiation between viruses. This goes
to show how different the genomes of viruses, even within the same
families, and orders can be at strategic locations. The colored boxes
reflect where nucleotides overlap between sequences and species.
ggmsa(multiple_virus_align, start = 2060, end = 2080)
At last my friends, we begin to make phylogenetic trees!
To begin with we will construct a basic phylogenetic
tree using neighbor join method NJ. Neighbor join method
takes in a large stringset that has been aligned such as
virus align that I created and uses a distance matrix to
create the tree.
To start my Neighbor Join tree I first need to create a distance
matrix. I will do this with the DECIPHER package using the
`Distance matrix function.
Here is the code:
virus_dist <- DECIPHER::DistanceMatrix(virus_align, type='matrix')
Make it into a distance matrix:
virus_dist <- as.dist(virus_dist)
Here’s a glimpse at the values:
virus_dist[1:10]
## [1] 0.6980494 0.7022852 0.7014216 0.7044919 0.7026391 0.7032094 0.7017479
## [8] 0.7008485 0.6968324 0.7466097
round the values in the matrix:
virus_dist_rounded <- round(virus_dist, digits=3)
A glimpse at the rounded values:
virus_dist_rounded[1:10]
## [1] 0.698 0.702 0.701 0.704 0.703 0.703 0.702 0.701 0.697 0.747
In hindsight the actual values don’t matter that much, its the tree that does the explaining!
To create the tree we use the function nj() from the
package ape and insert the non rounded distances
virus_dist
treeNJ <- nj(virus_dist)
Here’s a zoomed out view of the resulting tree. The point is to be able to see the whole thing all at once! I know it’s huge!
Below you can scroll over the tree to zoom in on organism names
Neighbor joining is not the most accurate or representative method of creating a phylogenetic tree however. To construct my best phylogenetic tree I will use a different method than a basic neighbor join.
To build the best phylogenetic tree possible I will use an optimized model. By selecting a best fit model for my data to base my tree upon I will be able to build the most accurate phylogenetic tree possible.
To begin with I will import cleaned aligned sequences (from earlier)
that I previously saved into a FASTA file with the function
write.fasta. For the purposed of this assignment I will not
show the code to write the FASTA as I feel it is unnecessary, but I will
show importing the FASTA. I will save the imported sequences as the
object viruses for the purpose of identifying that the
aligned sequences are in FASTA format.
viruses <- read.phyDat(file = "./Data/virus_align.FASTA",
format = "fasta")
From here I can construct several different trees like a
upgma based on distance methods dm or a tree
using pratchet. However these would still be relatively
simple trees. And I want a better one
One way to measure how ‘good’ your phylogenetic tree is is with a parsimony score. The lower the parsimony score, the more representative or your total data set the tree is. The parsimony score is a measure of how many deletions or changes had to be made to your data (in my case 220 genome sequences of 8,000 - 20, 000 nucleotides each) in order to construct your tree.
to calculate the parsimony score for my previous tree
treeNJ I can simply run the function parsimony
from the phangorn package compared to my sequence data
viruses
parsimony(treeNJ, viruses)
## [1] 1031966
As you can see, over 1 million changes had to be made to my data to create this tree. In other words this tree SUCKS lets see if we can make it suck less :)
First, lets see what the parsimony scores would be for
upgma pratchet and even
random additon methods, just for curiosity.
Building upgma random addition and
pratchet trees using phanghorn.
dm <- dist.ml(viruses)
#building the same distance matrix I made before, just in a different way.
treeUPGMA <- upgma(dm)
treeRA <- random.addition(viruses)
treeRatchet <- pratchet(viruses, trace = 0, minit=100)
And finding their parsimony scores:
parsimony(c(treeUPGMA,treeRA, treeRatchet), viruses)
## [1] 942929 939453 937443
These are down to under 1 million. Which might be about the best we can do… it looks like treeRAtchet is the best so far. Let’s see if we can do any better using a maximum likelihood approach by selecting a best fit model.
To build a better tree the modeltest mt function from
the Phangorn package.This is extremely computationally
heavy:
mt <- modelTest(viruses)
This gives a list of possible models, and their performance values (like AIC). Here’s the first 10 rows of the resulting data frame. (it is over 92 rows long)
mt[1:10,]
## Model df logLik AIC AICw AICc AICcw BIC
## 1 JC 437 -3130390 6261654 0 6261662 0 6265484
## 2 JC+I 438 -3130390 6261656 0 6261664 0 6265495
## 3 JC+G(4) 438 -3049505 6099886 0 6099894 0 6103724
## 4 JC+G(4)+I 439 -3049505 6099888 0 6099896 0 6103735
## 5 F81 440 -3125273 6251425 0 6251434 0 6255282
## 6 F81+I 441 -3125273 6251427 0 6251436 0 6255292
## 7 F81+G(4) 441 -3040042 6080966 0 6080974 0 6084831
## 8 F81+G(4)+I 442 -3040042 6080968 0 6080976 0 6084841
## 9 K80 438 -3119213 6239303 0 6239311 0 6243141
## 10 K80+I 439 -3119213 6239305 0 6239313 0 6243152
This next function takes over 24 hours to run. It selects at
optimized model from mt to create a the most representative
tree possible from my data.
fit_mt <- pml_bb(mt, control = pml.control(trace = 0))
Here is the output of fit_mt. It shows the model
selected and relative rates of each nucleotide which is pretty
informative!
fit_mt
## model: GTR+G(4)
## loglikelihood: -3017233
## unconstrained loglikelihood: -419643.9
## Discrete gamma model
## Number of rate categories: 4
## Shape parameter: 1.499813
##
## Rate matrix:
## a c g t
## a 0.000000 2.174429 2.787524 1.404949
## c 2.174429 0.000000 1.518994 3.257664
## g 2.787524 1.518994 0.000000 1.000000
## t 1.404949 3.257664 1.000000 0.000000
##
## Base frequencies:
## a c g t
## 0.3046548 0.2077264 0.2137143 0.2739046
I can access the tree generated by calling:
fit_mt$tree
##
## Phylogenetic tree with 220 tips and 218 internal nodes.
##
## Tip labels:
## Lloviu cuevavirus isolate Lloviu virus/M.schreibersii-wt/ESP/2003/Asturias-Bat86, Pteromalus puparum negative-strand RNA virus 1, Mengla dianlovirus isolate Rousettus-wt/CHN/2015/Sharen-Bat9447-1, Hubei rhabdo-like virus 5 strain WHSFII19440 RNA-dependent RNA polymerase gene, Kern Canyon virus nucleoprotein, Bovine orthopneumovirus, ...
## Node labels:
## 1, 1, 1, 1, 1, 0.994, ...
##
## Unrooted; includes branch lengths.
Now to compare the parsimony scores:
parsimony(c(treeRA, treeNJ, treeRatchet, fit_mt$tree), viruses)
## [1] 939453 1031966 937443 940039
Just based on parsimony score, it appears that treeRA
and treeRatchet may outperform fit_mt$tree.
But that is not necessarily the case. Parsimony score is only based on
changes in the data. Not necessarily how representative of the data the
tree actually is. Because fit_mt$tree was generated from
the best fitting model of over 95 simulated models, I have a gut feeling
it is going to be the best tree, because it most likely is using the
best model. 95 models were tested against the data using the
mt and pml_bb method and only one model was
used with the other methods, so I choose the fit_mt$tree. Basically
parsimony score gives us an idea of how well the tree represents the
data but it isn’t everything, its just one measure and several of the
methods have very close parsimony scores nonetheless. Based on just
parsimony score however treeRatchet is the front
runner.
Another thing that gives fit_mt$tree an advantage is that ultrafast bootstrapping is also conducted by default. Bootstrapping is yet another way to test for model validity. Bootstrap values in essence tell how many times out of 100 the same branch in the same position would be observed with re sampled data, this is expressed as a percentage in decimal form. A boot strap value above .95 is best.
Lets have a look at fit_mt$tree:
This phylogenetic tree shows the evolutionary relationship of viruses in the order Mononegavirales. Nodes (spots where 2 branches/lines diverge) represent that last common ancestor of that species. Viruses that share a more recent common ancestor, are more closely related. Viruses that have a common ancestor farther back (toward the left) on the tree are less related. Thus time should be thought of as passing from the left to the right on this tree. Organisms farther to the right are more recently diverged, and organisms toward the left are less recently diverged.
Below you can scroll over the tree to zoom in on organism names